Indels and
Structural variants
Learning objectives
1. Understand the different types of structural variants
2. Learn the different algorithms to detect structural variants in
each sequencing technology
3. Annotate structural variants in VCF format
4. Visualize structural variants in IGV
5. Merge structural variants in VCF format
6. Benchmark structural variants
Types of genetic variation
ctctgag
ctccgag
Single-nucleotide
polymorphisms
(SNPs)
Single-nucleotide
variants
(SNVs)
Types of genetic variation
ctctgag
ctccgag
Single-nucleotide
polymorphisms
(SNPs)
ctctgag
ctc--ag
Insertion-deletion
polymorphisms
(INDELs)
Single-nucleotide
variants
(SNVs)
Types of genetic variation
ctctgag
ctccgag
Single-nucleotide
polymorphisms
(SNPs)
ctctgag
ctc--ag
Insertion-deletion
polymorphisms
(INDELs)
ctc ag
ctcaag
Structural
variants
(SVs)
Single-nucleotide
variants
(SNVs)
Types of genetic variation
ctctgag
ctc--ag
Insertion-deletion
polymorphisms
(INDELs)
ctc ag
ctcaag
Structural
variants
(SVs)
Differences between Indels and SVs are often blurry. In
general:
Biology
Indels Replication slippage
SVs
Recombination issues: Nonallelic homologous recombination
DNA double strand break repair: Non-homologous
end-joining (NHEJ)
Size (historical alignment of short-reads)
Indels < 50bp
SVs > 50bp
Detection
Indels Gaps and insertions in the alignment process,
detected by variant callers
SVs Specialized tools to detect signals of SV (although
long read mappers can produce large gaps)
Indels
In(sertion)Del(eletion)
A mutation that results from the gain or loss of a sequence
Reference TCCAGCAATCAGCGTCAAGCTT
Sample TCAAGCAA---GCGTCAAGCAA
Reference TCCAGCAA GCGTCAAGCTT
Sample TCAAGCAA(TCA)GCGTCAAGCAA
Indels
Replication slippage
Indels
Replication slippage
Indels
In(sertion)Del(eletion)
Indels
Left alignment of the variants (bcftools norm)
Indel realignment
Local realignment around indels
Correct mapping errors more precise indel discovery
Indel realignment
Local realignment around indels
Correct mapping errors more precise indel discovery
Indel realignment
Before
After
Indel realignment
Before
After
Indel realignment
GATK
Indel realignment
ABRA
java -Xmx16G -jar $abra_jar
--in <bam_file>
--out <bam_file>
--ref <reference_genome>
--threads <n>
--tmpdir <dir>
> <file>.log
Structural variants
D
B CA
Reference
DB CA B
Duplication
CB
D
Inversion
A
DC
A
Deletion
DB CX
Insertion
A
Translocation
R
B
Q
A
Structural variants
Nonallelic homologous recombination
Structural variants
Nonallelic homologous recombination
Double strand break
Non-
homologous
end-joining
(NHEJ)
1 2 3
1 2 3
1 3
1 3
1 3
1 3
1 3
1 3
2
2
1 3
1 3
1 2 3
1 2 3
1 3
1 3
Homologous
recombination
(HR)
21
Structural variants
Structural variants
Ref.
Exp.
(a) Depth of
coverage
(b) Paired-end
mapping
(c) Split-read
mapping
(d) de novo
assembly
Low High
Resolution
Evidence of structural differences
Structural variants
Structural variants - Deletions
Depth of Coverage
Drop in coverage
is a sign of a
deletion
DNA or
cDNA
fragment
read from
each end
align to
reference
Structural variants Paired end sequencing
DNA or
cDNA
fragment
read from
each end
align to
reference
F
R
Structural variants Paired end sequencing
DNA or
cDNA
fragment
read from
each end
align to
reference
F
R F R
F R
Structural variants Paired end sequencing
DNA or
cDNA
fragment
read from
each end
insert siz e (or template length)
align to
reference
inferred ins ert size (or observed template length)
Structural variants Paired end sequencing
reference
genome
subject
A
A
B
B
Structural variants Deletions
Insert size
reference
genome
subject
A
A B
B
Structural variants Deletions
Insert size
reference
genome
subject
A
A B
B
Structural variants Deletions
Insert size
reference
genome
subject
A
A B
B
Structural variants Deletions
Insert size
reference
genome
subject
A
A B
B
Structural variants Deletions
Insert size
reference
genome
subject
A
A B
B
Structural variants Deletions
Insert size
reference
genome
subject
A
A B
B
Structural variants Deletions
Insert size
reference
genome
subject
Inferred insert size
Expected insert siz e
Inferred insert size is greater than expected value
Insert size
Structural variants Deletions
reference
genome
subject
Pairs with larger than expected insert size are colored red in IGV
Structural variants Deletions
Insert size
Structural variants Deletions
Insert size
Long insert
size
reference
genome
subject
Reads spanning the breakpoints will get split when
mapped to the reference genome and mapped to both
sides of the breakpoints precise breakpoint detection
Structural variants Deletions
Split reads
reference
genome
subject
1. Depth of coverage
Structural variants Deletions
reference
genome
subject
1. Depth of coverage
Structural variants Deletions
reference
genome
subject
1. Depth of coverage
2. Paired-end mapping
Structural variants Deletions
reference
genome
subject
1. Depth of coverage
2. Paired-end mapping
Structural variants Deletions
reference
genome
subject
1. Depth of coverage
2. Paired-end mapping
3. Split read
Structural variants Deletion
reference
genome
subject
Structural variants Deletion
1. Depth of coverage
2. Paired-end mapping
3. Split read
reference
genome
subject
Structural variants Insertion
1. Paired-end mapping
reference
genome
subject
Structural variants Insertion
1. Paired-end mapping
Shorter insert
size than
expected is a
sign of an
insertion
reference
genome
subject
Structural variants Insertion
1. Paired-end mapping
2. Split read
reference
genome
subject
Structural variants Insertion
1. Paired-end mapping
2. Split read
Split reads with
an unmapped
region is a sign
of a read
spanning an
insertion
reference
genome
subject
1. Coverage
Structural variants - Duplication
reference
genome
subject
1. Coverage
Increase in
coverage is a
sign of a
duplication
Structural variants - Duplication
Normal
Tumor
Duplication
Structural variants Duplication
reference
genome
subject
Structural variants Duplication
1. Coverage
2. Split read
Split reads with
overlapping
sequences is a
sign of a tandem
duplication
Orientation of paired reads can reveal
evidence of structural events, including:
Inversions
Duplications
Translocations
Structural variants
A B
AB
reference
genome
subject
Structural variants - Inversions
reference
genome
subject
A B
AB
Structural variants - Inversions
reference
genome
subject
A B
AB
Structural variants - Inversions
reference
genome
subject
A B
AB
Structural variants - Inversions
reference
genome
subject
A B
AB
Structural variants - Inversions
reference
genome
subject
A B
AB
Structural variants - Inversions
reference
genome
subject
forward forward
A B
AB
Structural variants - Inversions
reference
genome
subject
A B
AB
Structural variants - Inversions
reference
genome
subject
reverse
reverse
A B
AB
Structural variants - Inversions
reference
genome
subject
A B
AB
Structural variants - Inversions
Structural variants - Inversions
Paired reads
with the same
orientation
indicate the
breakpoint of an
inversion
Pairs with same
reverse orientation
Pairs with same
forward
orientation
Note drop in coverage at breakpoints
Structural variants - Inversions
Aaron Wenger , PacBio
Pacbio
Illumina
Structural variants - Inversions
Long reads can
span long
deletions without
splitting the read!
They will show
long gaps
A B
A B
A C
A B C
A B
A B
X
concordant (+/-)
too big (+/-)
= deletion
Test
genome
Ref.
genome
too small (+/-)
= spanned insertion
B
A C
B
A C
everted (-/+)
= tandem duplication
B
BA C
B
B
C
A
B
same strand (+/+ or -/-)
= inversion
Quinlan and Hall, 2012
Structural variants
Quinlan and Hall, 2012
Paired-end
Sample
Ref.
Split-read
Structural variants
Structural variant annotation
CGTGTtgtagtaCCGTAA Reference
CGTGT-------CCGTAA Sample
CGTGTtgtagtaCCGTAA Reference
CGTGT-------CCGTAA Sample
1. Direct sequence notation
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
5
.
TTGTAGTA
T
60
PASS
GT
Structural variant annotation
CGTGTtgtagtaCCGTAA Reference
CGTGT-------CCGTAA Sample
1. Direct sequence notation
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
5
.
TTGTAGTA
T
60
PASS
GT
2. Symbolic notation
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
5
.
T
<DEL>
60
PASS
GT
Structural variant annotation
CGTGTtgtagtaCCGTAA Reference
CGTGT-------CCGTAA Sample
1. Direct sequence notation
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
5
.
TTGTAGTA
T
60
PASS
GT
2. Symbolic notation
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
SAMPLE
chr1
5
.
T
<DEL>
60
PASS
SVLEN=7
GT
1/1
Structural variant annotation
Assignment 2
We will compare Indel and SV calling for three sequencing
technologies:
1. Illumina
2. PacBio
3. Nanopore
The benchmarking will be done using the well-characterized
HG002
Assignment 2
Assignment 2
Chromosome
Position
ID
Reference
Quality
Filter
Info
Format
Genotype
1
10415
ACCCTAACCCTAACCCTAACCCTAAC
.
.
GT
1/1
1
62297
T
.
.
GT
1/1
Assignment 2 – Intersecting SV
Chromosome
Position
ID
Reference
Quality
Filter
Info
Format
Genotype
1
10415
ACCCTAACCCTAACCCTAACCCTAAC
.
.
GT
1/1
1
62297
T
.
.
GT
1/1
10414
10440
62296
62297
Step 1: Get VCF file to BED file
VCF
BED
Chromosome
Position
ID
Reference
Quality
Filter
Info
Format
Genotype
1
10415
ACCCTAACCCTAACCCTAACCCTAAC
.
.
GT
1/1
1
62297
T
.
.
GT
1/1
10414
10440
62296
62297
Step 2: Get benchmark VCF file to BED file
VCF
BED
Assignment 2 – Intersecting SV
Step 3: Intersect coordinates of SV
10
20
15
25
Assignment 2 – Intersecting SV
Step 3: Intersect coordinates of SV
10
20
15
25
Percent overlap
If percent overlap > x% -> Same SV
Assignment 2 – Intersecting SV
Step 3: Intersect coordinates of SV
10
20
15
25
Percent overlap
If percent overlap > x% -> Same SV
* Reciprocal overlap Both Variants need to overlap with each other
Assignment 2 – Intersecting SV
Step 3: Intersect coordinates of SV
1
50
30 55
Percent overlap
If percent overlap > x% -> Same SV
* Reciprocal overlap Both Variants need to overlap with each other
Assignment 2 – Intersecting SV
Step 3: Intersect coordinates of SV
1
50
30 55
Percent overlap
If percent overlap > x% -> Same SV
* Reciprocal overlap Both Variants need to overlap with each other
bedtools intersect
-a sample.bed
-b benchmark.bed
-f 0.50
-r
Assignment 2 – Intersecting SV
Filtering SV by length
Most structural variant callers will add an INFO/SVLEN field that
can be used for filtering SVs by length
Filtering Indels by length
BCFtools can be used to easily filter Indels by length using ”ILEN
ILEN will count up the length of the indel for you: positive numbers
are for insertions, and negative are deletions.
# To remove indels smaller than 5bp:
bcftools view -i '(ILEN >= -5 && ILEN <= 5)’
TYPE is a built-in that will determine whether a variant is an indel, snp,
etc. on-the-fly, so you can use the OR logic to include variants that
aren't indels (i.e. snps) using the second command. Note that
otherwise, ILEN explicitly excludes anything that's not an indel.
# To remove indels smaller than 5bp and preserve SNPs
Bcftools view -i '(ILEN >= -5 && ILEN <= 5) || TYPE!="INDEL”’